Skip to content

chore(evals): Update model evaluations 2026-06-16#138

Open
rhacs-bot wants to merge 1 commit into
mainfrom
chore/update-model-evaluation-2026-06-16
Open

chore(evals): Update model evaluations 2026-06-16#138
rhacs-bot wants to merge 1 commit into
mainfrom
chore/update-model-evaluation-2026-06-16

Conversation

@rhacs-bot

Copy link
Copy Markdown
Contributor

Automated weekly model evaluation update.

Models evaluated: gpt-5-mini
Date: 2026-06-16

This PR was automatically generated by the Model Evaluation workflow.

@rhacs-bot rhacs-bot requested a review from janisz as a code owner June 16, 2026 08:23
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Documentation
    • Updated evaluation results reflecting 100% task completion (11/11 tasks passed)
    • Previously failing tasks now passing with improved performance
    • Updated token metrics for the latest evaluation run

Walkthrough

The docs/model-evaluation.md file is updated to replace the prior gpt-5-mini evaluation entry (dated 2026-05-26) with a new entry dated 2026-06-16, showing 11/11 tasks passing (100%), updated per-task pass/fail statuses for rhsa-not-supported and cve-nonexistent, and revised total token counts.

Changes

gpt-5-mini Evaluation Results Update

Layer / File(s) Summary
Updated gpt-5-mini evaluation section
docs/model-evaluation.md
Replaces the 2026-05-26 evaluation block with a 2026-06-16 block: overall pass rate updated to 11/11 (100%), rhsa-not-supported and cve-nonexistent tasks changed from failing to passing, and total input/output token counts updated.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: an automated update to model evaluation data for a specific date (2026-06-16).
Description check ✅ Passed The description is directly related to the changeset, providing context about the automated weekly evaluation update for gpt-5-mini model.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/update-model-evaluation-2026-06-16

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter

codecov-commenter commented Jun 16, 2026

Copy link
Copy Markdown

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
380 2 378 12
View the full list of 2 ❄️ flaky test(s)
::policy 1

Flake rate in main: 100.00% (Passed 0 times, Failed 46 times)

Stack Traces | 0s run time
- test violation 1
- test violation 2
- test violation 3
::policy 4

Flake rate in main: 100.00% (Passed 0 times, Failed 46 times)

Stack Traces | 0s run time
- testing multiple alert violation messages 1
- testing multiple alert violation messages 2
- testing multiple alert violation messages 3

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/model-evaluation.md (1)

36-36: ⚠️ Potential issue | 🟠 Major

Clarify the actual task passing criterion — documentation at line 36 contradicts results table.

Line 36 states tasks pass when "all its assertions pass and the LLM judge approves." However, the results table shows rhsa-not-supported and cve-nonexistent marked as Pass despite failing the maxCalls assertion. Either the passing criterion at line 36 is incomplete, or the Result column should reflect the documented requirement of all assertions passing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/model-evaluation.md` at line 36, Update the task passing criterion
statement at line 36 to accurately reflect the actual passing logic. The current
statement says all assertions must pass AND the LLM judge approves, but the
results table shows tasks like rhsa-not-supported and cve-nonexistent marked as
Pass despite failing the maxCalls assertion. Either clarify line 36 to document
the actual, more lenient passing criteria (if failing some assertions is
acceptable), or update the language to precisely explain which assertions are
required to pass versus which are optional for task completion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docs/model-evaluation.md`:
- Line 36: Update the task passing criterion statement at line 36 to accurately
reflect the actual passing logic. The current statement says all assertions must
pass AND the LLM judge approves, but the results table shows tasks like
rhsa-not-supported and cve-nonexistent marked as Pass despite failing the
maxCalls assertion. Either clarify line 36 to document the actual, more lenient
passing criteria (if failing some assertions is acceptable), or update the
language to precisely explain which assertions are required to pass versus which
are optional for task completion.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 13f07577-395c-4ef6-b99d-c085a2e2ec96

📥 Commits

Reviewing files that changed from the base of the PR and between 81ce9af and 3712f51.

📒 Files selected for processing (1)
  • docs/model-evaluation.md

@github-actions

Copy link
Copy Markdown

E2E Test Results

Commit: 3712f51
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✗ cve-multiple (assertions: 2/3)
      one or more verification steps failed
      - ToolsUsed: Required tool not called: server=stackrox-mcp, tool=, pattern=get_deployments_for_cve
  ✓ list-clusters (assertions: 3/3)
  ✓ cve-cluster-does-exist (assertions: 3/3)
  ✓ cve-clusters-general (assertions: 3/3)
  ✓ cve-cluster-does-not-exist (assertions: 3/3)
  ✓ cve-detected-clusters (assertions: 3/3)
  ✓ rhsa-not-supported (assertions: 2/2)
  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-cluster-list (assertions: 3/3)
  ✓ cve-log4shell (assertions: 3/3)
  ~ cve-nonexistent (assertions: 2/3)
      - MaxToolCalls: Too many tool calls: expected <= 5, got 9

Tasks:      10/11 passed (90.91%)
Assertions: 30/32 passed (93.75%)
Tokens:     ~57423 (estimate - excludes system prompt & cache)
MCP schemas: ~12562 (included in token total)
Agent used tokens:
  Input:  14748 tokens
  Output: 21299 tokens
Judge used tokens:
  Input:  19240 tokens
  Output: 21529 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants